Linking Databases using Matched Arabic Names
نویسنده
چکیده
In this paper, a new hybrid algorithm that combines both token-based and character-based approaches is presented. The basic Levenshtein approach also has been extended to the token-based distance metric. The distance metric is enhanced to set the proper granularity level behavior of the algorithm. It smoothly maps a threshold of misspelling differences at the character level and the importance of token level errors in terms of token position and frequency. Using a large Arabic dataset, the experimental results show that the proposed algorithm successfully overcomes many types of errors, such as typographical errors, omission or insertion of middle name components, omission of non-significant popular name, and different writing style character variations. When compared with other classical algorithms, using the same dataset, the proposed algorithm was found to increase the minimum success level of the best tested lower limit algorithm (Soft TFIDF) from 69% to about 80%, while achieving an upper accuracy level of 99.67%.
منابع مشابه
Biodiversity informatics: the challenge of linking data and the role of shared identifiers
A major challenge facing biodiversity informatics is integrating data stored in widely distributed databases. Initial efforts have relied on taxonomic names as the shared identifier linking records in different databases. However, taxonomic names have limitations as identifiers, being neither stable nor globally unique, and the pace of molecular taxonomic and phylogenetic research means that a ...
متن کاملOptimization of morphology and geometry of encapsulated Hypophthalmichthys molitrix oil
In the present study, the effect of stirring speed and the type of cross linking agent on the size and formation of microencapsulated Silver carp (Hypophthalmichthys molitrix) oil were investigated. The gelatin/gum Arabic was used for encapsulating and the capsules were prepared by complex coacervation. Microcapsules were analyzed by optical microscopy technique and particle size analyzer. Re...
متن کاملLinked Data Driven Dynamic Web Services for Providing Multilingual Access to Diverse Japanese Humanities Databases
Several cultural domain resources in different languages have become available as Linked Open Data (LOD) in the last few years. However, there is little re-use of this data in multilingual information retrieval applications. The paper discusses Linked Data driven approaches in providing integrated multilingual access to diverse Japanese humanities databases by linking and re-using LOD resources...
متن کاملEvaluation of advanced techniques for multi-party privacy-preserving record link- age on real-world health databases
The linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy ...
متن کاملA cascaded approach to normalising gene mentions in biomedical literature
Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJCLCLP
دوره 19 شماره
صفحات -
تاریخ انتشار 2014